Portugal, in 2006, had an early school leaving rate of nearly 40% for 18-24 year olds, while the EU average was 15% (Eurostat 2007). Various analyses and interventions, therefore, have targeted school success in Portuguese adolescents and attempted to improve education overall. Portuguese students are lacking in mathematics performance, specifically.
We are therefore analyzing student data from two Portuguese secondary schools. This dataset measures mathematics course grade as the outcome, as a result of 30 predictors that range from demographic factors and family situation to student habits and extracurricular involvement. Specifically, data was collected from 395 students across an age range of 15-22 years (Cortez & Silva 2008).
Portuguese schools have a five-level grade classification system. With a final score measured from 0-20, categories are as follows: Excellent (16-20), Good (14-15), Satisfactory (12-13), Sufficient (10-11), Fail (0-9). For the purposes of this analysis, however, we analyzed exact score rather than score category.
In this analysis, we wish to predict student success in secondary school math as a result of two types of factors:
The potential outcomes of this analysis are:
The inital covariates in the dataset are:
And the response variables are:
For this analysis, we will be considering the G3 final grade as our response.
## # A tibble: 0 x 3
## # … with 3 variables: G1 <int>, G2 <int>, G3 <int>
Let’s remove the students with a grade of 0, as that doesn’t make much practical sense, especially since all of these students have grades higher than 0 first and/or second period.
**BB: agree with this approach, if we are feeling adventurous we could treat those 0’s as missing and impute them but let’s not over complicate things yet # MICE code if we decide to impute ## student <- read.table(“student-mat.csv”, sep=“;”, header=T) student <- as.data.frame(student)
student\(G3[student\)G3 == 0] <- NA md.pattern(as.data.frame(student)) #38 students missing G3, almost 10% of data is missing imputed_data <- mice(student, m=5, maxit = 50, method = “pmm”, seed = 500) summary(imputed_data) imputed_data\(imp\)G3 imputed<-complete(imputed_data)
Since the variable “absences” has a greater range of values than most of the other variables, and since after 22 absences the data becomes extremely sparse, we decided to group 23-93 absences within the value “23.”
“Absences” still has a much greater spread of values than the other variables, so we will try binning them to see how that effects our models.
Based on these plots, we decided to remove the variable “guardian” from our dataset since it didn’t seem to have any influence on the outcome.
We also observed that students in this dataset were from two different schools. Using school in our prediction of math course success, although it may increase prediction accuracy, would make our analysis less generalizable to the overall population of students. Since the overarching goal of this analysis is to identify factors that systematically affect secondary school student success in math courses, we decided to remove school as a predictor variable.
Just to get an initial sense of influential variables we can look at all variables up to this point and examine fit statistics. This bit of code will look at permutations of all possible models up to nvmax and then report the best model based on fit criteria. For example, with nvmax of 30 it will produce the best model for 1 covariate, 2 covariates…up to the full model. ### Looking at coefficients from best subset above based on BIC # Notice based on BIC while the “best” option has 8 covariates, we really are not giving up much in the way of BIC using 7 covariates. We have presented both sets of coefficients here and generally want the more parsimonious model when all other criteria are the same.
## (Intercept) Mjobhealth Mjobservices Fjobteacher failures
## 14.6123286 1.7919419 1.4684797 2.0295712 -1.1000624
## schoolsupyes goout health absences
## -2.3415181 -0.4729677 -0.2581331 -0.4284993
## (Intercept) Mjobhealth Mjobservices Fjobteacher failures
## 13.6819891 1.6852313 1.3837170 2.0288848 -1.1282540
## schoolsupyes goout absences
## -2.3052617 -0.4655962 -0.4141869
VIF and Tolerance
| Variable | VIF | Tolerance |
|---|---|---|
| Fjobother | 6.342636 | 0.1576631 |
| Fjobservices | 5.576032 | 0.1793390 |
| Mjobteacher | 3.294431 | 0.3035425 |
| Mjobservices | 3.086430 | 0.3239989 |
| Mjobother | 2.919526 | 0.3425214 |
| Medu | 2.799083 | 0.3572598 |
| Fjobteacher | 2.766049 | 0.3615265 |
| Walc | 2.443405 | 0.4092649 |
| Mjobhealth | 2.423100 | 0.4126945 |
| Fjobhealth | 2.275655 | 0.4394340 |
Displayed above are the covariates with the highest Variance Inflation Factors.
EigenAnalysis for the Correlation and Scaled SSCP Matrices
| Eigenvalue | Condition.Index | |
|---|---|---|
| 35 | 0.1042497 | 5.546255 |
| 36 | 0.0685997 | 6.837166 |
| Eigenvalue | Condition.Index |
|---|---|
| 0.0063958 | 57.15686 |
From the Correlation Matrix, the highest Condition Index is still very small, which means no collinearity needs to be taken care of there. From the Scaled SSCP Matrix, there are two Condition Indices over 30.
## PC36 PC37
## [1,] -0.54238432 -0.768156681
## [2,] 0.01500628 0.001119414
## [3,] -0.01588949 0.052591238
## [4,] 0.04850298 0.010901229
## [5,] -0.01501923 0.024552007
## [6,] 0.06871137 0.031398323
## [7,] 0.36810775 -0.223722976
## [8,] -0.02014247 0.047035204
## [9,] -0.15226341 0.102934232
## [10,] -0.18494245 0.141719926
## [11,] -0.19775873 0.138712109
## [12,] -0.20887103 0.139184678
## [13,] -0.11254029 0.125983102
## [14,] -0.34217703 0.393506844
## [15,] -0.22943971 0.277311183
## [16,] -0.14535872 0.157940062
## [17,] -0.02812198 0.031426805
Above shows the last 2 PC’s of the Scaled SSCP Matrix for the 1st-17th covariates (the rest have values close to 0). We can see that the 8th and 11-17th covariates may be collinear with the intercept. These covariates are listed below.
## [1] "Medu" "Mjobother" "Mjobservices" "Mjobteacher"
## [5] "Fjobhealth" "Fjobother" "Fjobservices" "Fjobteacher"
Many of these are factor variables that have reference values in the itnercept, so it makes sense they migh tbe collinear. Not going to remove any.
There are many types of models to consider when building a regression classifier, such as Multiple Linear Regression, ANOVA, Linear Mixed Models, and Logistic/Poisson Regression. Classically, Linear Mixed Models are used when observations are related and Logistic/Poisson Regression are used for binrary outcomes. Since we are assuming our observations (the students) are independent for our purpose, and the outcome variable (final grade) is numeric, we narrow our model search down to Multiple Linear Regression and ANOVA. Since, at least to start, we have many factor-type covariates, we begin exploring Multiple Linear Regression models and their fit to our data.
*BB: since categorical vars are being grouped into several vars we have more than 30 Betas. I count 39 below.
The first model we consider is a full model of all covariates possible and no interaction terms included. The model can be written as follows:
\[y = \beta_0 + \beta_1X_1 + ... + \beta_{37}X_{37} + \epsilon\] where \(\beta_i\) represents … and \(X_i\) represents a covariate. There are more covariates than expected because each factor variable is recreated as a series of “dummy” variables. If a factor has \(n\) levels, then there will be \(n-1\) variables, and therefore parameters, corresponding to that covariate.
##
## Call:
## lm(formula = G3 ~ ., data = student_d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5064 -1.8720 0.0293 1.8227 6.9600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.61184 1.74233 7.238 3.39e-12 ***
## sexM 0.57764 0.35502 1.627 0.10471
## age -0.18458 0.14030 -1.316 0.18923
## addressU 0.48945 0.40361 1.213 0.22615
## famsizeLE3 0.24556 0.34389 0.714 0.47571
## PstatusT -0.24239 0.50396 -0.481 0.63087
## Medu 0.17054 0.22593 0.755 0.45090
## Fedu 0.13734 0.19477 0.705 0.48125
## Mjobhealth 1.14310 0.80393 1.422 0.15603
## Mjobother -0.58096 0.52655 -1.103 0.27071
## Mjobservices 0.84323 0.58848 1.433 0.15287
## Mjobteacher -0.71277 0.74734 -0.954 0.34094
## Fjobhealth -0.63368 1.01710 -0.623 0.53371
## Fjobother -0.63043 0.74669 -0.844 0.39913
## Fjobservices -0.79711 0.77579 -1.027 0.30497
## Fjobteacher 1.18464 0.94423 1.255 0.21053
## reasonhome 0.26472 0.39912 0.663 0.50765
## reasonother -0.09380 0.57353 -0.164 0.87019
## reasonreputation 0.13815 0.41101 0.336 0.73699
## traveltime 0.02845 0.23921 0.119 0.90542
## studytime 0.51548 0.20779 2.481 0.01362 *
## failures -0.82301 0.25773 -3.193 0.00155 **
## schoolsupyes -2.42939 0.46632 -5.210 3.40e-07 ***
## famsupyes -0.63394 0.33989 -1.865 0.06308 .
## paidyes -0.44278 0.33601 -1.318 0.18852
## activitiesyes 0.12753 0.32141 0.397 0.69179
## nurseryyes -0.27189 0.39819 -0.683 0.49522
## higheryes 0.39340 0.85823 0.458 0.64699
## internetyes 0.61575 0.44649 1.379 0.16883
## romanticyes -0.09634 0.34193 -0.282 0.77832
## famrel 0.12137 0.17748 0.684 0.49457
## freetime 0.10646 0.16699 0.638 0.52424
## goout -0.45766 0.16638 -2.751 0.00628 **
## Dalc 0.05727 0.23179 0.247 0.80502
## Walc -0.15349 0.17833 -0.861 0.39005
## health -0.23289 0.11382 -2.046 0.04156 *
## absences -0.36947 0.14796 -2.497 0.01303 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.787 on 320 degrees of freedom
## Multiple R-squared: 0.3296, Adjusted R-squared: 0.2542
## F-statistic: 4.371 on 36 and 320 DF, p-value: 2.751e-13
** ADJUSTED R2 of 0.256 BOO **
In order to determine the validity of this model, we must test the assumptions of linear regression. These assumptions include homogeneity of variances and gaussian errors.
## RESID ABS PRED
## 1 2.698348 2.698348 12.039974
## 2 2.544754 2.544754 10.371589
## 3 2.516308 2.516308 11.339381
## 4 -2.454889 2.454889 11.506401
## 5 2.421202 2.421202 11.522224
## 6 2.419423 2.419423 11.716532
## 7 2.398537 2.398537 12.654729
## 8 -2.324983 2.324983 12.230358
## 9 2.276336 2.276336 12.075219
## 10 2.260274 2.260274 13.063600
## 11 -2.241573 2.241573 11.990685
## 12 2.203381 2.203381 8.205372
## 13 -2.125594 2.125594 11.673253
## 14 2.118535 2.118535 12.430768
## 15 2.057514 2.057514 12.684111
## 16 -2.048861 2.048861 11.302894
## 17 -2.026352 2.026352 12.250541
## 18 -2.022032 2.022032 11.184054
Since this set of largest studentized residuals are all greater than 2, it may be of interest to examine them in further detail. ** this part might be better in the outlier detection section **
##
## Anderson-Darling normality test
##
## data: dat_resid$RESID
## A = 0.29059, p-value = 0.6087
From the code above, since the p-value 0.4184 is greater than \(\alpha = 0.05\), we accept the null hypothesis that the residuals are normally distributed.
Since, from the collinearity analysis, several variables were significantly colinear with the intercept, we next try to build a model identical to the full model with the removal of the intercept.
##
## Call:
## lm(formula = G3 ~ . - 1, data = student_d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5506 -1.8695 0.1517 1.8989 8.0781
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## sexM 0.754517 0.381480 1.978 0.04880 *
## age 0.014509 0.148178 0.098 0.92206
## addressU 1.105820 0.424932 2.602 0.00969 **
## famsizeLE3 0.496159 0.368521 1.346 0.17914
## PstatusT 0.768732 0.521541 1.474 0.14147
## Medu 0.317803 0.242360 1.311 0.19070
## Fedu 0.332139 0.207772 1.599 0.11090
## Mjobhealth 0.780292 0.864211 0.903 0.36726
## Mjobother -0.506167 0.567026 -0.893 0.37270
## Mjobservices 0.668814 0.633313 1.056 0.29174
## Mjobteacher -1.176189 0.801990 -1.467 0.14347
## Fjobhealth 1.128215 1.063670 1.061 0.28963
## Fjobother 1.261640 0.753360 1.675 0.09497 .
## Fjobservices 1.188711 0.781597 1.521 0.12928
## Fjobteacher 2.942862 0.982784 2.994 0.00296 **
## reasonhome 0.452118 0.428985 1.054 0.29271
## reasonother 0.561494 0.610001 0.920 0.35801
## reasonreputation 0.277318 0.442207 0.627 0.53102
## traveltime 0.506018 0.247660 2.043 0.04185 *
## studytime 0.895230 0.216552 4.134 4.55e-05 ***
## failures -0.619055 0.275930 -2.244 0.02554 *
## schoolsupyes -2.093990 0.499782 -4.190 3.61e-05 ***
## famsupyes -0.610801 0.366076 -1.669 0.09619 .
## paidyes -0.542762 0.361601 -1.501 0.13434
## activitiesyes 0.050217 0.345997 0.145 0.88469
## nurseryyes 0.120396 0.424898 0.283 0.77709
## higheryes 3.253194 0.820618 3.964 9.08e-05 ***
## internetyes 0.795038 0.480166 1.656 0.09875 .
## romanticyes -0.188022 0.368039 -0.511 0.60979
## famrel 0.464018 0.184239 2.519 0.01227 *
## freetime 0.389599 0.174863 2.228 0.02657 *
## goout -0.440626 0.179184 -2.459 0.01446 *
## Dalc -0.005198 0.249484 -0.021 0.98339
## Walc -0.031654 0.191219 -0.166 0.86863
## health -0.088728 0.120704 -0.735 0.46282
## absences -0.260683 0.158545 -1.644 0.10111
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.002 on 321 degrees of freedom
## Multiple R-squared: 0.9434, Adjusted R-squared: 0.937
## F-statistic: 148.6 on 36 and 321 DF, p-value: < 2.2e-16
** ADJUSTED R2 OF .9421 !!!!!! **
## RESID ABS PRED
## 1 2.907693 2.907693 10.921897
## 2 2.535449 2.535449 6.816460
## 3 2.531060 2.531060 9.895896
## 4 2.475705 2.475705 11.073511
## 5 2.449680 2.449680 11.011177
## 6 2.424527 2.424527 12.089761
## 7 -2.268277 2.268277 12.550590
## 8 2.233351 2.233351 11.676215
## 9 -2.215128 2.215128 11.334789
## 10 2.168795 2.168795 11.738572
## 11 2.100816 2.100816 7.033648
## 12 -2.080427 2.080427 12.806819
## 13 2.079162 2.079162 13.036056
## 14 -2.050184 2.050184 11.897140
## 15 -2.044874 2.044874 11.702064
##
## Anderson-Darling normality test
##
## data: dat_resid$RESID
## A = 0.33791, p-value = 0.503
##
## Call:
## lm(formula = G3 ~ sexM + age + addressU + Mjobhealth + Mjobservices +
## Fjobteacher + studytime + failures + schoolsupyes + famsupyes +
## goout + health + absences, data = student_d)
##
## Coefficients:
## (Intercept) sexM age addressU Mjobhealth
## 13.1670 0.6631 -0.2024 0.6295 1.7986
## Mjobservices Fjobteacher studytime failures schoolsupyes
## 1.4065 2.1208 0.5661 -0.9418 -2.3196
## famsupyes goout health absences
## -0.6310 -0.4774 -0.2617 -0.3240
##
## Call:
## lm(formula = G3 ~ sexM + age + addressU + Mjobhealth + Mjobservices +
## Fjobteacher + studytime + failures + schoolsupyes + famsupyes +
## goout + health + absences, data = student_d)
##
## Coefficients:
## (Intercept) sexM age addressU Mjobhealth
## 13.1670 0.6631 -0.2024 0.6295 1.7986
## Mjobservices Fjobteacher studytime failures schoolsupyes
## 1.4065 2.1208 0.5661 -0.9418 -2.3196
## famsupyes goout health absences
## -0.6310 -0.4774 -0.2617 -0.3240
##
## Call:
## lm(formula = G3 ~ sexM + age + addressU + famsizeLE3 + PstatusT +
## Medu + Fedu + Mjobhealth + Mjobother + Mjobservices + Mjobteacher +
## Fjobhealth + Fjobother + Fjobservices + Fjobteacher + reasonhome +
## reasonother + reasonreputation + traveltime + studytime +
## failures + schoolsupyes + famsupyes + paidyes + activitiesyes +
## nurseryyes + higheryes + internetyes + romanticyes + famrel +
## freetime + goout + Dalc + Walc + health + absences, data = student_d)
##
## Coefficients:
## (Intercept) sexM age addressU
## 12.61184 0.57764 -0.18458 0.48945
## famsizeLE3 PstatusT Medu Fedu
## 0.24556 -0.24239 0.17054 0.13734
## Mjobhealth Mjobother Mjobservices Mjobteacher
## 1.14310 -0.58096 0.84323 -0.71277
## Fjobhealth Fjobother Fjobservices Fjobteacher
## -0.63368 -0.63043 -0.79711 1.18464
## reasonhome reasonother reasonreputation traveltime
## 0.26472 -0.09380 0.13815 0.02845
## studytime failures schoolsupyes famsupyes
## 0.51548 -0.82301 -2.42939 -0.63394
## paidyes activitiesyes nurseryyes higheryes
## -0.44278 0.12753 -0.27189 0.39340
## internetyes romanticyes famrel freetime
## 0.61575 -0.09634 0.12137 0.10646
## goout Dalc Walc health
## -0.45766 0.05727 -0.15349 -0.23289
## absences
## -0.36947
Backward and Both give the same variables.
Forward gives all the variables.
DOES NOT give same exact answer if we use model1 with intercept or model2 without to do the stepwise process.
##
## Call:
## lm(formula = G3 ~ sexM + age + addressU + Mjobhealth + Mjobservices +
## Fjobteacher + studytime + failures + schoolsupyes + famsupyes +
## goout + health + absences, data = student_d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9070 -1.9329 -0.1166 1.8534 7.3467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.1670 0.8580 15.347 < 2e-16 ***
## sexM 0.6631 0.3151 2.104 0.036096 *
## age -0.2024 0.1294 -1.564 0.118670
## addressU 0.6295 0.3612 1.743 0.082292 .
## Mjobhealth 1.7986 0.5323 3.379 0.000811 ***
## Mjobservices 1.4065 0.3444 4.084 5.51e-05 ***
## Fjobteacher 2.1208 0.5664 3.745 0.000212 ***
## studytime 0.5661 0.1880 3.012 0.002788 **
## failures -0.9418 0.2358 -3.994 7.94e-05 ***
## schoolsupyes -2.3196 0.4464 -5.197 3.49e-07 ***
## famsupyes -0.6310 0.3116 -2.025 0.043653 *
## goout -0.4774 0.1366 -3.494 0.000538 ***
## health -0.2617 0.1060 -2.468 0.014074 *
## absences -0.3240 0.1328 -2.440 0.015181 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.743 on 343 degrees of freedom
## Multiple R-squared: 0.3042, Adjusted R-squared: 0.2778
## F-statistic: 11.53 on 13 and 343 DF, p-value: < 2.2e-16
This model so far gives us the best R2 adjusted and the most significant variables.
## RESID ABS PRED
## 1 2.728324 2.728324 11.653339
## 2 2.714544 2.714544 11.763080
## 3 2.594468 2.594468 10.103717
## 4 2.493234 2.493234 12.314119
## 5 2.350334 2.350334 11.859641
## 6 2.332082 2.332082 11.696637
## 7 2.254386 2.254386 11.999121
## 8 2.196549 2.196549 12.233189
## 9 -2.182972 2.182972 11.906954
## 10 -2.182586 2.182586 10.906406
## 11 2.128817 2.128817 12.355736
## 12 -2.100461 2.100461 11.622068
## 13 -2.052568 2.052568 11.514590
## 14 2.048496 2.048496 8.495680
## 15 2.035560 2.035560 7.549169
## 16 2.005830 2.005830 10.601775
##
## Anderson-Darling normality test
##
## data: dat_resid$RESID
## A = 0.55571, p-value = 0.1506
Of all possible interaction terms, several were significant. Among those, the most “interpretable” are studytime:schoolsupyes, studytime:famsupyes, studytime:goout, and schoolsupyes:goout. Interestingly, it appears that the effects of studytime are reduced when one goes out a lot; perhaps fatigue reduces the effectiveness of studying.
##
## Call:
## lm(formula = G3 ~ sexM + Mjobhealth + Mjobservices + Fjobteacher +
## studytime + failures + schoolsupyes + famsupyes + paidyes +
## internetyes + goout + health + absences + studytime:schoolsupyes +
## studytime:famsupyes + studytime:goout + schoolsupyes:goout,
## data = student_d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3845 -1.7486 -0.0282 1.8463 7.2936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.9712 1.3878 7.906 3.78e-14 ***
## sexM 0.5683 0.3133 1.814 0.070602 .
## Mjobhealth 1.8297 0.5264 3.476 0.000576 ***
## Mjobservices 1.4148 0.3406 4.154 4.14e-05 ***
## Fjobteacher 2.0923 0.5615 3.727 0.000227 ***
## studytime 1.4287 0.5851 2.442 0.015131 *
## failures -1.1644 0.2323 -5.012 8.70e-07 ***
## schoolsupyes 3.1224 1.6756 1.863 0.063259 .
## famsupyes -1.5353 0.8059 -1.905 0.057613 .
## paidyes -0.4169 0.3101 -1.344 0.179735
## internetyes 0.6524 0.4080 1.599 0.110768
## goout 0.2661 0.3562 0.747 0.455607
## health -0.2340 0.1052 -2.225 0.026749 *
## absences -0.3547 0.1313 -2.702 0.007231 **
## studytime:schoolsupyes -1.4887 0.5333 -2.791 0.005545 **
## studytime:famsupyes 0.5053 0.3740 1.351 0.177549
## studytime:goout -0.3400 0.1653 -2.057 0.040417 *
## schoolsupyes:goout -0.7082 0.3604 -1.965 0.050213 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.708 on 339 degrees of freedom
## Multiple R-squared: 0.3296, Adjusted R-squared: 0.296
## F-statistic: 9.804 on 17 and 339 DF, p-value: < 2.2e-16
## RESID ABS PRED
## 1 2.767414 2.767414 11.70644
## 2 2.617816 2.617816 12.03399
## 3 2.550088 2.550088 12.30526
## 4 -2.450654 2.450654 11.38453
## 5 2.406855 2.406855 11.82766
## 6 2.371032 2.371032 10.73981
## 7 2.347761 2.347761 11.75319
## 8 2.327249 2.327249 11.88466
## 9 2.318470 2.318470 7.90480
## 10 -2.288029 2.288029 12.10269
## 11 -2.231194 2.231194 10.94818
## 12 -2.121865 2.121865 11.65760
## 13 2.037154 2.037154 12.63631
##
## Anderson-Darling normality test
##
## data: dat_resid$RESID
## A = 0.46134, p-value = 0.2578
Normality assumption is justified for model.int
##
## Call:
## lm(formula = G3 ~ sexM + Mjobhealth + Mjobservices + Fjobteacher +
## studytime + failures + schoolsupyes + famsupyes + goout +
## health + absences, data = student_d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8085 -1.9348 -0.1866 1.9893 7.2760
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.3782 0.7956 16.816 < 2e-16 ***
## sexM 0.6779 0.3164 2.143 0.032841 *
## Mjobhealth 1.9787 0.5307 3.729 0.000225 ***
## Mjobservices 1.5156 0.3436 4.411 1.38e-05 ***
## Fjobteacher 2.1484 0.5694 3.773 0.000190 ***
## studytime 0.5406 0.1888 2.863 0.004453 **
## failures -1.0778 0.2290 -4.706 3.66e-06 ***
## schoolsupyes -2.1153 0.4326 -4.890 1.55e-06 ***
## famsupyes -0.5913 0.3120 -1.895 0.058899 .
## goout -0.4831 0.1365 -3.539 0.000457 ***
## health -0.2596 0.1065 -2.437 0.015334 *
## absences -0.3581 0.1314 -2.725 0.006755 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.76 on 345 degrees of freedom
## Multiple R-squared: 0.2914, Adjusted R-squared: 0.2688
## F-statistic: 12.9 on 11 and 345 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = G3 ~ Mjobhealth + Mjobservices + Fjobteacher + failures +
## schoolsupyes + goout, data = student_d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5735 -1.8793 -0.0725 1.9254 7.9254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.0704 0.4698 27.819 < 2e-16 ***
## Mjobhealth 1.7801 0.5422 3.283 0.001130 **
## Mjobservices 1.3399 0.3522 3.804 0.000168 ***
## Fjobteacher 2.0188 0.5860 3.445 0.000640 ***
## failures -1.2676 0.2301 -5.508 7.05e-08 ***
## schoolsupyes -2.2548 0.4384 -5.143 4.51e-07 ***
## goout -0.4989 0.1405 -3.552 0.000435 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.855 on 350 degrees of freedom
## Multiple R-squared: 0.231, Adjusted R-squared: 0.2178
## F-statistic: 17.52 on 6 and 350 DF, p-value: < 2.2e-16
## RESID ABS PRED
## 1 2.814574 2.814574 11.07460
## 2 2.630448 2.630448 11.57355
## 3 2.453589 2.453589 12.07249
## 4 -2.334139 2.334139 12.57143
## 5 -2.323367 2.323367 11.57355
## 6 2.281179 2.281179 10.57566
## 7 -2.277173 2.277173 12.41453
## 8 2.270601 2.270601 11.57355
## 9 2.270601 2.270601 11.57355
## 10 2.205028 2.205028 13.85261
## 11 2.166041 2.166041 11.91558
## 12 -2.146474 2.146474 12.07249
## 13 -2.146474 2.146474 12.07249
##
## Anderson-Darling normality test
##
## data: dat_resid_small$RESID
## A = 0.3526, p-value = 0.4646
Small model satisfies conditions for Gaussian assumption.
The small model also appears to satisfy the diagnostic checks for homogeneity and linearity. Existence is trivial, but we may to discuss limitations of the independence assumption (maybe kids form study groups or grades are based on a curve)
##
## Call:
## lm(formula = G3 ~ Mjobhealth + Mjobservices + Fjobteacher + studytime +
## failures + schoolsupyes + absences + health + goout + studytime *
## schoolsupyes + studytime * goout, data = student_d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1336 -1.9368 -0.1505 1.8851 8.2491
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.4070 1.2565 9.078 < 2e-16 ***
## Mjobhealth 1.7048 0.5297 3.218 0.001412 **
## Mjobservices 1.3989 0.3436 4.071 5.81e-05 ***
## Fjobteacher 2.1243 0.5691 3.733 0.000222 ***
## studytime 1.5558 0.5493 2.832 0.004892 **
## failures -1.0943 0.2307 -4.743 3.08e-06 ***
## schoolsupyes 0.3178 1.1839 0.268 0.788499
## absences -0.4045 0.1303 -3.104 0.002070 **
## health -0.2267 0.1058 -2.143 0.032782 *
## goout 0.1789 0.3483 0.514 0.607747
## studytime:schoolsupyes -1.2951 0.5322 -2.433 0.015470 *
## studytime:goout -0.3380 0.1668 -2.026 0.043501 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.761 on 345 degrees of freedom
## Multiple R-squared: 0.2909, Adjusted R-squared: 0.2683
## F-statistic: 12.86 on 11 and 345 DF, p-value: < 2.2e-16
## RESID ABS PRED
## 1 3.055590 3.055590 10.75086
## 2 2.656816 2.656816 10.76960
## 3 2.548899 2.548899 12.10298
## 4 2.456520 2.456520 12.30675
## 5 2.366434 2.366434 10.59299
## 6 2.277013 2.277013 12.89385
## 7 -2.244737 2.244737 11.13363
## 8 -2.171386 2.171386 11.94272
## 9 2.168581 2.168581 12.27931
## 10 -2.130752 2.130752 10.72714
## 11 2.101355 2.101355 12.25781
## 12 -2.091628 2.091628 10.46858
## 13 -2.089416 2.089416 12.68832
## 14 2.080450 2.080450 12.34633
## 15 -2.077110 2.077110 11.67240
## 16 2.065044 2.065044 12.44727
## 17 -2.051670 2.051670 11.58201
## 18 2.043983 2.043983 12.57572
## 19 -2.010162 2.010162 11.34648
##
## Anderson-Darling normality test
##
## data: dat_resid$RESID
## A = 0.48213, p-value = 0.2295
##
## Call:
## lm(formula = G3 ~ Mjobhealth + Mjobservices + Fjobteacher + failures +
## schoolsupyes + goout + health + absences, data = student_d_all)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.1619 -1.7970 0.4408 2.7480 8.4881
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.1885 0.8886 13.716 < 2e-16 ***
## Mjobhealth 2.2767 0.7659 2.972 0.00314 **
## Mjobservices 1.5218 0.4909 3.100 0.00208 **
## Fjobteacher 1.4156 0.8087 1.751 0.08083 .
## failures -2.1902 0.2891 -7.576 2.66e-13 ***
## schoolsupyes -1.2548 0.6285 -1.996 0.04659 *
## goout -0.4418 0.1910 -2.313 0.02123 *
## health -0.1973 0.1523 -1.295 0.19609
## absences 0.2879 0.1885 1.527 0.12755
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.164 on 386 degrees of freedom
## Multiple R-squared: 0.1907, Adjusted R-squared: 0.1739
## F-statistic: 11.37 on 8 and 386 DF, p-value: 1.699e-14
Influence – Cook’s Distance - Values near 1 are bad
Outlier – Leverage Values - Values greater than 2*p/n are bad
take out values that are outlliers and run the regression again?
We decide to further remove a few outliers informally to increase the accuracy of our analysis. For example, we noticed that there were five students in the 20-22 age range. Since ages 15-18 had 70+ students per age, and age 19 had 19 students, we decided to remove the 20-22 year old students. For students of this age to be in secondary school, it is likely that they had certain extenuating circumstances that may not be represented in the dataset. Moreover, we simply have too few observations in this age range to be able to accurately analyze the effects of age on outcome in the 20+ region. Therefore, we went ahead and removed these students.
##
## 0 1 2 3 4 5 6 7
## 76 97 90 70 19 3 1 1
Aside from looking at the Adjusted R2, can look at the AIC or SBC/BIC (smaller is better in both cases) can add more/different models to this code, but here is the basis
| model | Parameters | AICs | BICs | Adjusted.R.Squared |
|---|---|---|---|---|
| Model 1: Full Model | 37 | 1782.011 | 1929.365 | 0.2542165 |
| Model 2: No Intercept | 36 | 1834.145 | 1977.621 | 0.9370466 |
| Model 3: Stepwise Model | 14 | 1749.316 | 1807.482 | 0.2778046 |
| Model 4: Stepwise Model + Interaction Terms | 18 | 1744.031 | 1817.708 | 0.2959758 |
| Model 5: Reduced Stepwise Model | 7 | 1771.009 | 1802.031 | 0.2178292 |
| Model 6: Reduced Interaction Model | 12 | 1752.077 | 1802.488 | 0.2682632 |
[1] P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th Future Business Technology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.
[2] Eurostat, 2007. Early school-leavers. http://epp.eurostat.ec.europa.eu/.